Introduction

In this tutorial you will learn how to be determine the distribution of your data. By the end of the tutorial you will be able to:

  • calculate the four moments
  • compare the distributions with the ideal ones
  • draw quartile plots and qqplots

Getting Started

We will visualize two financial series and earthquakes datasets. To be able to follow you need:

  1. to download earthquakes.csv and findata.xlsx data uploaded on LEARN
  2. some packages including
  • tidyverse
  • moments
  • qqtest

Reading and Preprocessing Data

Let’s read the findata and earthquakes datasets as below:

## # A tibble: 6 x 7
##   Date        MSFT  AAPL    DJI `S&P`  Coca Pepsi
##   <chr>      <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-21  183.  317. 24474. 2949.  45.2  130.
## 2 2020-05-22  184.  319. 24465. 2955.  45.0  130.
## 3 2020-05-26  182.  317. 24995. 2992.  46.1  130.
## 4 2020-05-27  182.  318. 25548. 3036.  46.7  131.
## 5 2020-05-28  181.  318. 25401. 3030.  47.1  132.
## 6 2020-05-29  183.  318. 25383. 3044.  46.7  132.

The data contains information of four companies and two financial indices. We are interested in their daily returns, so we will define calc_return function which calculates return using today’s price and yesterday’s price, P_{n}/P_{n-1} -1. Besides :

##             MSFT         AAPL          DJI           S&P         Coca
## [1,]  0.03571208  0.055556561  0.022255692  0.0144088553  0.044417687
## [2,]  0.01725028 -0.004784381 -0.008857952 -0.0079476008 -0.020689655
## [3,] -0.02543175  0.033653825  0.007321911  0.0047300507  0.004694986
## [4,] -0.01739026 -0.013954671 -0.001072728 -0.0007633938  0.007009345
## [5,] -0.02654705  0.066037776  0.009111015  0.0039897580 -0.002320482
## [6,] -0.02727101 -0.022122490 -0.019775601 -0.0135283550 -0.038371875
##             Pepsi
## [1,]  0.039634249
## [2,] -0.019061631
## [3,]  0.005979122
## [4,] -0.002971792
## [5,] -0.007451517
## [6,] -0.030030054

The problem with financial series is that there are some outliers that reduces the quality of visuals. For example there are market crashs that is very unlikely but when it happens the daily return is around -35%. We can remove them using another function:

## [1] 8449    6
##          MSFT         AAPL          DJI           S.P         Coca        Pepsi
## 1  0.03571208  0.055556561  0.022255692  0.0144088553 -0.020689655  0.039634249
## 2  0.01725028 -0.004784381 -0.008857952 -0.0079476008  0.004694986 -0.019061631
## 3 -0.02543175  0.033653825  0.007321911  0.0047300507  0.007009345  0.005979122
## 4 -0.01739026 -0.013954671 -0.001072728 -0.0007633938 -0.002320482 -0.002971792
## 5 -0.02654705  0.066037776  0.009111015  0.0039897580  0.014510277 -0.007451517
## 6 -0.02727101 -0.022122490 -0.019775601 -0.0135283550 -0.008343189 -0.030030054

Second data that we wil use is earthquakes:

## 'data.frame':    23412 obs. of  21 variables:
##  $ Date                      : Factor w/ 12401 levels "01/01/1967","01/01/1969",..: 35 102 134 236 268 301 368 465 501 536 ...
##  $ Time                      : Factor w/ 20472 levels "00:00:03","00:00:04",..: 11785 9866 15448 16088 11612 11666 11604 19888 9901 9195 ...
##  $ Latitude                  : num  19.25 1.86 -20.58 -59.08 11.94 ...
##  $ Longitude                 : num  145.6 127.4 -174 -23.6 126.4 ...
##  $ Type                      : Factor w/ 4 levels "Earthquake","Explosion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Depth                     : num  132 80 20 15 15 ...
##  $ Depth.Error               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Depth.Seismic.Stations    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Magnitude                 : num  6 5.8 6.2 5.8 5.8 6.7 5.9 6 6 5.8 ...
##  $ Magnitude.Type            : Factor w/ 11 levels "","MB","MD","MH",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ Magnitude.Error           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Magnitude.Seismic.Stations: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Azimuthal.Gap             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Horizontal.Distance       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Horizontal.Error          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Root.Mean.Square          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ ID                        : Factor w/ 23412 levels "AK11232962","AK11248623",..: 2580 2581 2582 2583 2584 2585 2586 2587 2711 2588 ...
##  $ Source                    : Factor w/ 13 levels "AK","ATLAS","CI",..: 5 5 5 5 5 5 5 5 6 5 ...
##  $ Location.Source           : Factor w/ 48 levels "AEI","AEIC","AG",..: 21 21 21 21 21 21 21 21 21 21 ...
##  $ Magnitude.Source          : Factor w/ 24 levels "1000","1009",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ Status                    : Factor w/ 2 levels "Automatic","Reviewed": 1 1 1 1 1 1 1 1 1 1 ...

The data has dates that are read as factors. We need them as dates. Besides we will need Datetime to calculate time difference between two earthquakes, also Yearmonth to be able to count number of events each month:

## [1] 23412    23
##         Date     Time Latitude Longitude       Type Depth Depth.Error
## 1 1965-01-02 13:44:18   19.246   145.616 Earthquake 131.6          NA
## 2 1965-01-04 11:29:49    1.863   127.352 Earthquake  80.0          NA
## 3 1965-01-05 18:05:58  -20.579  -173.972 Earthquake  20.0          NA
## 4 1965-01-08 18:49:43  -59.076   -23.557 Earthquake  15.0          NA
## 5 1965-01-09 13:32:50   11.938   126.427 Earthquake  15.0          NA
## 6 1965-01-10 13:36:32  -13.405   166.629 Earthquake  35.0          NA
##   Depth.Seismic.Stations Magnitude Magnitude.Type Magnitude.Error
## 1                     NA       6.0             MW              NA
## 2                     NA       5.8             MW              NA
## 3                     NA       6.2             MW              NA
## 4                     NA       5.8             MW              NA
## 5                     NA       5.8             MW              NA
## 6                     NA       6.7             MW              NA
##   Magnitude.Seismic.Stations Azimuthal.Gap Horizontal.Distance Horizontal.Error
## 1                         NA            NA                  NA               NA
## 2                         NA            NA                  NA               NA
## 3                         NA            NA                  NA               NA
## 4                         NA            NA                  NA               NA
## 5                         NA            NA                  NA               NA
## 6                         NA            NA                  NA               NA
##   Root.Mean.Square           ID Source Location.Source Magnitude.Source
## 1               NA ISCGEM860706 ISCGEM          ISCGEM           ISCGEM
## 2               NA ISCGEM860737 ISCGEM          ISCGEM           ISCGEM
## 3               NA ISCGEM860762 ISCGEM          ISCGEM           ISCGEM
## 4               NA ISCGEM860856 ISCGEM          ISCGEM           ISCGEM
## 5               NA ISCGEM860890 ISCGEM          ISCGEM           ISCGEM
## 6               NA ISCGEM860922 ISCGEM          ISCGEM           ISCGEM
##      Status            Datetime YearMonth
## 1 Automatic 1965-01-02 13:44:18    1965-1
## 2 Automatic 1965-01-04 11:29:49    1965-1
## 3 Automatic 1965-01-05 18:05:58    1965-1
## 4 Automatic 1965-01-08 18:49:43    1965-1
## 5 Automatic 1965-01-09 13:32:50    1965-1
## 6 Automatic 1965-01-10 13:36:32    1965-1

Continuous Distributions

There are a couple of discrete distributions each are for different random variables:

Normal

\[X \sim N(\mu,\sigma^2) \Rightarrow f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac12(\frac{x-\mu}{\sigma})^2}\]

Exponential

\[X \sim exp(\lambda) \Rightarrow f(x) = \lambda e^{-\lambda x}\]

Uniform

Log-Normal

Beta

Gamma

Weibull

Quantile Plots - Chef’s Knife

Identifying Distributions

Moments - Measures for Distribution Characteristics

There are two moments that are important in life. The moment you are born and the moment you find out why… Nevermind.

There are four moments that are important in summarizing characteristics of distributions. These are first, second, third and fourth moments which correspond to:

  1. Mean: The average
  2. Variance: Square of standard deviation
  3. Skewness: The asymmetry
  4. Kurtosis: Thickness of tails

You will learn moments, moment generating function etc. in detail. R doesn’t list them for you, so we will use moments package for it.

The moments can have certain values specific to each distribution. For example 3rd and 4th moment of Normal is 0 and 3 respectively. For other distributions you can sometimes see excess kurtosis, which is benchmarked to kurtosis of normal distribution:

  • skewness and ex. kurtosis of Exponential is 2 and 6 respectively.
  • skewness and ex. kurtosis of Uniform is 0 and -6/5 (no kidding) respectively.
  • skewness and ex. kurtosis of Poisson is \(\sqrt \lambda\) and \(\lambda^{-1}\)

You can check them from Wikipedia by searching the distribution. On the right hand side the table summarizes these properties.

We will now generate random variables and calculate their moments. Also we define fourmoments function to summarize the four moments:

##        Mean    Variance    Skewness    Kurtosis 
## -0.01634422  1.00645313 -0.04879710  2.95845771

We will use apply to apply the function to all columns:

##               Normal Exponential    Uniform        Beta
## Mean     -0.01634422   0.3266587 0.50267007  0.60474814
## Variance  1.00645313   0.1046882 0.08356373  0.02025603
## Skewness -0.04879710   2.1312615 0.02188583 -0.26241843
## Kurtosis  2.95845771  10.1662499 1.79866096  2.78964164

The four moments are close to the ideal ones but there are small deviations. For example Kurtosis of Normal is 2.95 as opposed to 3 whereas Uniform is 1.8 as it is for the ideal 3-6/5 = 1.8. But we can see the deviation is much larger for Exponential, the kurtosis for ideal distribution is 9 (or excess kurtosis is 6) whereas we see 10. This is not a big difference because a small number of unlikely events can make yor data leptokurtic (thick tails).

If we generate larger numbers it will get closer to the ideal moments:

##                Normal Exponential     Uniform        Beta
## Mean     -0.009463987   0.3343949  0.50011903  0.59707382
## Variance  1.005879410   0.1141077  0.08367487  0.02166831
## Skewness -0.004849413   2.0193060 -0.01112597 -0.17851546
## Kurtosis  2.933946209   8.9445674  1.80793466  2.55468255

Quantile Comparisons

On the left we can see the the distributions back2back. Microsoft returns are more compressed to the center whereas Apple returns have thicker tails. It can be read from the quantile plots. Between returns -0.02 and 0 we can see Microsoft quantiles are more flat than Apple, between -0.04 to -0.02 Apple is flatter. Same for numbers greater than 0. Apple is more risky than Microsoft.

Pepsi and Coca cola group are less risky than the tech companies. Their distributions are quite closer to each other. What do the moments say?

##            MSFT   AAPL     DJI     S.P   Coca  Pepsi
## Mean     0.0011 0.0011  0.0004  0.0004 0.0005 0.0005
## Variance 0.0003 0.0005  0.0001  0.0001 0.0002 0.0002
## Skewness 0.1583 0.1245 -0.1670 -0.1900 0.1177 0.0708
## Kurtosis 3.7222 3.6565  3.9586  4.0470 3.6895 3.7874

Tech companies on the average has higher return. Also their variance (risk^2) is higher with Apple’s risk is the highest. All the companies are right skewed meaning that the distribution is in favour of negative returns (there are more numbers on the left than right.). All the data are leptokurtic, the tails are thick.

The sample size is large. In practics the we don’t always have this amount of data. Let’s check the last 300 days:

Apparently Apple is continuing to be risky, both in positive and negative directions.

Let’s focus on Microsoft. Is it Normal? To check it we will generate Normal numbers with the same mean and variance:

The black line is the ideal distribution. We can see that for positive numbers the blue points are closely following the black line but for negative numbers it deviates more. Also for the negative tail (-0.050, -0.025) the blue points are higher than the black line which translates into the negative extreme returns are more likely than normal.

qqplots

Comparing points with their ideal distribution is effectively showing where the numbers exactly devieates, and how many numbers that do so. One visualization that is obtained from the same philosophy is quantile-quantile plots. Here instead of plotting quantiles and overlapping, we discard probability points and place y-axis of both series (quantiles) to x and y axes. Also for qqplot you can use standard normal. The only difference will be x axis tick labels:

Good news is we don’t have to write too much code for qqplot if we want to check normality. We can use qqnorm and qqline as below:

The points deviate from the black line on the tails. For right tail they are quite close to the line, but on the left tail they deviate much.

qqtest Package

Our Statistics faculty is working hard to help you engineers to visualize things. Here is a package written by a University of Waterloo professor.

In the previous plot we saw some points are deviating but some far, some not. Some are just few some are quite a few. We can add envelope to check which are they statistically tolerable:

The left tail points deviate more than tolerable range. This means we can reject that the Microsoft daily return series are normal, especially for their left-tail behaviour.

In finance theory the risk must be chi-squarred RV. To check it we can calculate the deviation from the mean and check whether it is chi-squared:

It is hard to say it is chi-squarred.

The data is not exponential either. But by tweaking a little, we could fit a Weibull distribution.

Earthquake Data

Number of Earthquakes

Now we can ask whether the number of earthquakes are poisson and the time between two events is exponential. To this end we will aggregate the earthquakes data. But we will focus on destructive ones, i.e. Magnitude >= 6.5:

The data is pretty much Poisson with some differences. There is no 0 earthquakes and there is a pile up around n=10. These differences might be because we are using an international data, so there is a lot variety between locations and their jeological structures.

##     Mean Variance Skewness Kurtosis 
## 3.841137 5.584436 1.274698 5.139980

The moments are different than the ideal ones where the four moments must be \(\lambda, \lambda, \sqrt\lambda, 1/\lambda\).

Days Passed

Now we can calculate the time difference between each two events:

Pretty much overlapping except the probability is a bit higher on the tails.

##      Mean  Variance  Skewness  Kurtosis 
##  8.265258 88.936180  2.146942  9.893428

qqplot with envelops also verifies there is a strong evidence that the distribution is exponential. Besides the skewness and kurtosis are 2.15 and 9.89 close to the ideal ones, 2 and 9.

Magnitude

What about the magnitudes of each earthquake?

Deliverables

1

  1. Choose a continuous variable (not discrete) and plot the histogram:
  • change fill and/or colour to make your plot fancier
  • interpret the graph, is it left-right skewed?
  • are tails irregularly thich in certain ranges?
  1. Also calculate 4 moments and interpret
  • Is it left or right skewed?
  • Is it platykurtic or leptokurtic?

2

Choose a continuous variable (not discrete), or generate one:

  • Plot the sample quantiles agains probability points (probs)
  • Is it unimodal or multimodal?
  • What are the first, second (median) and third quantiles? Show how they can be located on quantile plot.
  • What are the min and max value?
  • Are there more points before (on the left of) the median or after (on the right of) the median?

3

If your data is left skewed find or create another data (which is rightskewed or not-skewed). Fit a distribution to the same continuous variable:

  • Use qqtest to check whether your data is normal, lognormal, chisquared or exponential
  • Plot the sample quantiles agains probability points (probs). Generate ideal quantiles and add as a line on the quantile plot.